Tree-Based State Tying for High Accuracy Modelling

نویسندگان

  • Steve J. Young
  • J. J. Odell
  • Philip C. Woodland
چکیده

The key problem to be faced when building a HMM-based continuous speech recogniser is maintaining the balance between model complexity and available training data. For large vocabulary systems requiring cross-word context dependent modelling, this is particularly acute since many mmh contexts will never occur in the training data. This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree. This treebased clustering is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones. State-tying is also compared with traditional model-based tying and shown to be clearly superior. Experimental results are presented for both the Resource Management and Wall Street 3ournal tasks. 1. I N T R O D U C T I O N Hidden Markov Models (HMMs) have proved to be an effective basis for modelling time-varying sequences of speech spectra. However, in order to accurately capture the variations in real speech spectra (both inter-speaker and intra-speaker), it is necessary to have a large number of models and to use relatively complex output probability distributions. For example, to achieve good performance in a continuous density HMM system, it is necessary to use mixture Gaussian output probability distributions together with context dependent phone models. In practice, this creates a data insufficiency problem due to the resulting large number of model parameters. Furthermore, the data is usually unevenly spread so that some method is needed to balance model complexity against data availability. This data insufficiency problem becomes acute when a system incorporating cross-word context dependency is used. Because of the large number of possible crossword triphones, there are many models to estimate and a large number of these triphones will have few, if any, occurrences in the training data. The total number of triphones needed for any particular application depends on the phone set, the dictionary and the grammatical constraints. For example, there are about 12,600 position-independent triphones needed for the Resource Management task when using the standard word pair grammar and 20,000 when no grammar is used. For the 20k Wall Street Journal task, around 55,000 triphones are needed. However, only 6600 triphones occur in the Resource Management training data and only 18,500 in the SI84 section of the Wall Street Journal training data. Traditional methods of dealing with these problems involve sharing models across differing contexts to form so-called generalised triphones and using a posteriori smoothing techniques[5]. However, model-based sharing is limited in that the left and right contexts cannot be treated independently and hence this inevitably leads to sub-optimal use of the available data. A posteriori smoothing is similarly unsatisfactory in that the models used for smoothing triphones are typically biphones and monophones, and these will be rather too broad when large training sets are used. Furthermore, the need to have cross-validation data unnecessarily complicates the training process. In previous work, a method of HMM estimation has been described which involves parameter tying at the state rather than the model level[10,12]. This method assumes that continuous density mixture Gaussian distributions are used and it avoids a posteriori smoothing by first training robust single Gaussian models, then tying states using an agglomerative data clustering procedure and finally, converting each tied state to a mixture Ganssian. This works well for systems which have only word internal triphone models and for which it is therefore possible to find some data for every triphone. However, as indicated by the figures given above, systems which utilise cross-word triphones require data for a very large number of triphones and, in practice, many of them will be unseen in the training data. In this paper, the state tying approach is developed further to accommodate the construction of systems which have unseen triphones. The new system is based on the use of phonetic decision trees I1,2,6] which are used to determine contextually equivalent sets of HMM states. In order to be able to handle large training sets, the tree building is based only on the statistics encoded within

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robust decision tree state tying for continuous speech recognition

In this paper, methods of improving the robustness and accuracy of acoustic modeling using decision tree based state tying are described. A new two-level segmental clustering approach is devised which combines the decision tree based state tying with agglomerative clustering of rare acoustic phonetic events. In addition, a unified maximum likelihood framework for incorporating both phonetic and...

متن کامل

High accuracy acoustic modeling based on multi-stage decision tree

In many continuous speech recognition systems based on HMMs, decision tree-based state tying has been used for not only improving the robustness and accuracy of context dependent acoustic modeling but also synthesizing unseen models. To construct the phonetic decision tree, standard method has used just single Gaussian triphone models to cluster states. The coarse clusters generated using just ...

متن کامل

Comparing parameter tying methods for multilingual acoustic modelling

In this paper, we compare the state-level and model-level tying of continuous density hidden Markov models for the multilingual acoustic modelling. Using the model-level tying technique, the number of the language dependent (LD) phoneme models of five European languages were reduced to the desired number. This tying was based on dissimilarity measure between the LD phoneme models in a bottom-up...

متن کامل

Decision tree state tying based on penalized Bayesian information criterion

In this paper, an approach of penalized Bayesian information criterion (pBIC) for decision tree state tying is described. The pBIC is applied to two important applications. First, it is used as a decision tree growing criterion in place of the conventional approach of using a heuristic constant threshold. It is found that original BIC penalty is too low and will not lead to compact decision tre...

متن کامل

High resolution decision tree based acoustic modeling beyond CART

In this paper, an m-level optimal subtree based phonetic decision tree clustering algorithm is described. Unlike prior approaches, the m-level optimal subtree in the proposed approach is to generate log likelihood estimates using multiple mixture Gaussians for phonetic decision tree based state tying. It provides a more accurate model of the log likelihood variations in node splitting and it is...

متن کامل

Pruning of state-tying tree using bayesian information criterion with multiple mixtures

The use of context-dependent phonetic units together with Gaussian mixture models allows modern-day speech recognizer to build very complex and accurate acoustic models. However, because of data sparseness issue, some sharing of data across di erent triphone states is necessary. The acoustic model design is typically done in two stages, namely, designing the state-tying map and growing the numb...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994